Batch, Off-Policy and Model-Free Apprenticeship Learning
نویسندگان
چکیده
This paper addresses the problem of apprenticeship learning, that is learning control policies from demonstration by an expert. An efficient framework for it is inverse reinforcement learning (IRL). Based on the assumption that the expert maximizes a utility function, IRL aims at learning the underlying reward from example trajectories. Many IRL algorithms assume that the reward function is linearly parameterized and rely on the computation of some associated feature expectations, which is done through Monte Carlo simulation. However, this assumes to have full trajectories for the expert policy as well as at least a generative model for intermediate policies. In this paper, we introduce a temporal difference method, namely LSTD-μ, to compute these feature expectations. This allows extending apprenticeship learning to a batch and offpolicy setting.
منابع مشابه
Model-Free Imitation Learning with Policy Optimization
In imitation learning, an agent learns how to behave in an environment with an unknown cost function by mimicking expert demonstrations. Existing imitation learning algorithms typically involve solving a sequence of planning or reinforcement learning problems. Such algorithms are therefore not directly applicable to large, high-dimensional environments, and their performance can significantly d...
متن کاملLearned Prioritization for Trading Off Accuracy and Speed
Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforceme...
متن کاملBatch-to-Batch Iterative Learning Control of a Fed-batch Fermentation Process Using Incrementally Updated Models
Batch-to-batch iterative learning control of a fed-batch fermentation process using batchwise linearised models identified from process operation data is presented in this paper. Due to model-plant mismatches and the present of unknown disturbances, off-line calculated control policy may not be optimal when implemented to the real process. The repetitive nature of batch process allows informati...
متن کاملQ-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy act...
متن کاملAdaptive Batch Size for Safe Policy Gradients
PROBLEM • Monotonically improve a parametric gaussian policy πθ in a continuous MDP, avoiding unsafe oscillations in the expected performance J(θ). • Episodic Policy Gradient: – estimate ∇̂θJ(θ) from a batch of N sample trajectories. – θ′ ← θ+Λ∇̂θJ(θ) • Tune step size α and batch size N to limit oscillations. Not trivial: – Λ: trade-off with speed of convergence← adaptive methods. – N : trade-off...
متن کامل